Enable session roaming across multiple server instances #1519

davidroberts-merlyn · 2025-10-27T11:52:22Z

Motivation and Context

Problem

When deploying MCP servers across multiple instances (Kubernetes pods, Docker containers, worker processes), sessions are tied to the specific instance that created them. This requires sticky sessions at the load balancer level and prevents true horizontal scaling. Users are currently forced to choose between:

Sticky sessions - Suboptimal load distribution, sessions lost on pod failure
Single worker - Wastes resources, limits throughput
Stateless mode - Loses session continuity and event replay

This limitation is documented in multiple issues: #520 (multi-worker sessions), #692 (session reuse across instances), #880 (horizontal scalability), and #1350 (sticky session problems).

Solution

This PR enables session roaming - allowing sessions to seamlessly move between server instances without requiring sticky sessions. The key insight is that EventStore already serves as proof of session existence.

When a request arrives with a session ID that's not in an instance's local memory, if an EventStore is configured, the instance can safely:

Create a transport for that session ID (session roaming)
Let EventStore replay any missed events (continuity)
Handle the request seamlessly

What Changed

Modified streamable_http_manager.py (~50 lines):

Added session roaming logic in _handle_stateful_request()
When unknown session ID + EventStore exists → create transport (roaming!)
Extracted duplicate server task code into reusable methods
Updated docstrings to document session roaming capability

Added comprehensive tests (test_session_roaming.py, 510 lines):

Session roaming with EventStore
Rejection without EventStore
Concurrent request handling
Exception cleanup behavior
Fast path verification
Logging verification

Added production-ready example (simple-streamablehttp-roaming/, 13 files):

Complete working example with Redis EventStore
Multi-instance deployment support
Docker Compose configuration (3 instances + Redis + NGINX)
Kubernetes deployment example
Automated test script demonstrating roaming
Comprehensive documentation (README, QUICKSTART, implementation details)

Why This Approach

Previous Attempts

Eplored two other approaches before arriving at this solution:

Custom Session Store (outside SDK) - Created own session validation in the application layer, but this didn't solve the core problem and required every user to implement their own solution, it also meant that as the dict that contained session in the sdk was unchanged it still required sticky sessions.
SessionStore ABC (in SDK) - Added a new SessionStore interface requiring both EventStore + SessionStore parameters. While functional, this approach required two separate storage backends and was more complex than necessary. It also meant that if you did not supply one of the stores it was not really stateful.

Current Approach: EventStore-Only

The key insight: EventStore already proves sessions existed. If events exist for a session ID, that session must have existed to create those events. No separate SessionStore needed.

Benefits:

✅ One store instead of two (simpler)
✅ Reuses existing EventStore interface (no new APIs)
✅ Impossible to misconfigure (EventStore = both events + proof)
✅ Aligns with SEP-1359 (sessions are conversation context, not auth)
✅ Minimal code changes (~50 lines)
✅ 100% backward compatible (behavior enhancement only)

Usage

Before (Requires Sticky Sessions)

# Without EventStore - sessions in memory only
manager = StreamableHTTPSessionManager(app=app)
# Deployment: requires sticky sessions for multi-instance

After (No Sticky Sessions Needed)

# With EventStore - sessions roam freely
event_store = RedisEventStore(redis_url="redis://redis:6379")
manager = StreamableHTTPSessionManager(
    app=app,
    event_store=event_store  # Enables session roaming!
)
# Deployment: load balancer can route freely, no sticky sessions

How It Works

Client → Instance 1 (creates session "abc123", stores events in Redis)
Client → Instance 2 (with session "abc123")
  ↓
Instance 2 checks memory → not found
Instance 2 sees EventStore exists
Instance 2 creates transport for "abc123" (roaming!)
EventStore replays events from Redis
Session continues seamlessly ✅

How Has This Been Tested?

✅ Existing test suite (no regressions)
✅ 8 new tests for session roaming
✅ Automated roaming test script in example
✅ Testing within our existing infrastructure

The included example also demonstrates:

Multi-instance deployment with Docker Compose
Kubernetes manifests (3 replicas, no sessionAffinity needed)
NGINX load balancing without sticky sessions
Redis EventStore for shared state
Automated testing and verification with provided test script

Breaking Changes

None. This is a pure behavior enhancement:

✅ Existing code works unchanged
✅ No API changes
✅ No new required parameters
✅ Backward compatible

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the MCP Documentation
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have added or updated documentation as needed

Additional context

Related Issues

Closes #520, #692, #880, #1350

This implementation addresses the core limitation described in all these issues: the inability to run stateful MCP servers across multiple instances without sticky sessions.

Add session roaming support to StreamableHTTPSessionManager, allowing sessions to move freely between server instances without requiring sticky sessions. This enables true horizontal scaling and high availability for stateful MCP servers. When a request arrives with a session ID not found in local memory, the presence of an EventStore allows creating a transport for that session. EventStore serves dual purposes: storing events (existing) and proving session existence (new). This eliminates the need for separate session validation storage. Changes: - Add session roaming logic in _handle_stateful_request() - Extract duplicate server task code into reusable methods - Update docstrings to document session roaming capability - Add 8 comprehensive tests for session roaming scenarios - Add production-ready example with Redis EventStore - Include Kubernetes and Docker Compose deployment examples Benefits: - One store instead of two (EventStore serves both purposes) - No new APIs or interfaces required - Minimal code changes (~50 lines in manager) - 100% backward compatible - Enables multi-instance deployments without sticky sessions Example usage: event_store = RedisEventStore(redis_url="redis://redis:6379") manager = StreamableHTTPSessionManager( app=app, event_store=event_store # Enables session roaming ) Github-Issue: modelcontextprotocol#520 Github-Issue: modelcontextprotocol#692 Github-Issue: modelcontextprotocol#880 Github-Issue: modelcontextprotocol#1350

Change single quotes to double quotes to comply with prettier formatting requirements.

- Add language specifiers to all code blocks - Fix heading hierarchy (bold text to proper headings) - Add blank lines after headings for better readability - Escape underscores in file paths (__init__.py -> **init**.py)

The transport could be removed from _server_instances by the cleanup task if it crashed immediately after being started. This caused a KeyError when trying to access it from the dictionary. Fixed by keeping a local reference to the transport instead of looking it up again from the dictionary after starting the server task.

Use @contextlib.asynccontextmanager decorator instead of manual __aenter__/__aexit__ implementation for mock_connect functions. Fixes test failures in: - test_transport_server_task_cleanup_on_exception - test_transport_server_task_no_cleanup_on_terminated

Add AsyncIterator import and use proper return type annotation for mock_connect functions: AsyncIterator[tuple[AsyncMock, AsyncMock]] instead of Any.

The tests were failing because AsyncMock(return_value=None) caused app.run to complete immediately, which closed the transport streams and triggered cleanup that removed transports from _server_instances before assertions could check for them. Now using mock_app_run that calls anyio.sleep_forever() and blocks until the test context cancels it. This keeps transports alive during the test assertions.

jacksteamdev

LGTM 👍 There are a few extra .md files, but the logic looks sound.

jacksteamdev · 2025-10-28T15:27:44Z

examples/servers/simple-streamablehttp-roaming/FILES.md

Do we need this file? It's nice information, but none of the other examples have it.

jacksteamdev · 2025-10-28T15:28:36Z

examples/servers/simple-streamablehttp-roaming/QUICKSTART.md

Most of this information seems to be in README.md

jacksteamdev · 2025-10-28T15:30:14Z

examples/servers/simple-streamablehttp-roaming/test_roaming.sh

I'm on the edge about this script. None of the examples have bash scripts, but it's a nice DX sanity check.

davidroberts-merlyn added 8 commits October 27, 2025 18:23

Fix prettier formatting in docker-compose.yml

5e75bec

Change single quotes to double quotes to comply with prettier formatting requirements.

Fix markdownlint issues in example documentation

07cb1dd

- Add language specifiers to all code blocks - Fix heading hierarchy (bold text to proper headings) - Add blank lines after headings for better readability - Escape underscores in file paths (__init__.py -> **init**.py)

Add missing contextlib import for async context manager decorator

ea7813f

Fix pyright type errors for asynccontextmanager decorators

33f14f0

Add AsyncIterator import and use proper return type annotation for mock_connect functions: AsyncIterator[tuple[AsyncMock, AsyncMock]] instead of Any.

davidroberts-merlyn force-pushed the session-roaming branch from 1c9b3ce to ce114b2 Compare October 27, 2025 18:23

Merge branch 'main' into session-roaming

64d7f4e

jacksteamdev reviewed Oct 28, 2025

View reviewed changes

davidroberts-merlyn marked this pull request as draft October 28, 2025 18:54

felixweinberger added the pending publish Draft PRs need to be published for team to review label Oct 29, 2025

felixweinberger mentioned this pull request Oct 30, 2025

Implement SEP-1442: Make MCP Stateless #1544

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable session roaming across multiple server instances #1519

Enable session roaming across multiple server instances #1519

davidroberts-merlyn commented Oct 27, 2025 •

edited

Loading

Uh oh!

jacksteamdev left a comment

Uh oh!

jacksteamdev Oct 28, 2025

Uh oh!

jacksteamdev Oct 28, 2025

Uh oh!

jacksteamdev Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enable session roaming across multiple server instances #1519

Are you sure you want to change the base?

Enable session roaming across multiple server instances #1519

Conversation

davidroberts-merlyn commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Problem

Solution

What Changed

Why This Approach

Previous Attempts

Current Approach: EventStore-Only

Usage

Before (Requires Sticky Sessions)

After (No Sticky Sessions Needed)

How It Works

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

Related Issues

Uh oh!

jacksteamdev left a comment

Choose a reason for hiding this comment

Uh oh!

jacksteamdev Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

jacksteamdev Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

jacksteamdev Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davidroberts-merlyn commented Oct 27, 2025 •

edited

Loading